﻿ Marking-up multiple views of a Text: Discourse and Reference Dan Cristea Nancy Ide Laurent Romary Department of Computer Science Department of Computer Science Loria-CNRS University "A I Cuza" Iasi Vassar College B P 239 Iasi, 6600 Romania Poughkeepsie, NY, USA F-54506 Vandoeuvre Lès Nancy dcristea@infoiasi ro ide@cs vassar edu romary@loria fr Abstract encoding discourse structure, which both eliminates interference between the two encodings and supports We describe an encoding scheme for discourse structure and automatic extension reference, based on the TEI Guidelines and the In this paper, we describe our annotation scheme, recommendations of the Corpus Encoding Specification realized in an SGML/XML2 format compatible with (CES) A central feature of the scheme is a CES-based data the Text Encoding Initiative (TEI) Guidelines architecture enabling the encoding of and access to multiple (Sperberg-McQueen & Burnard, 1994) and the Corpus views of a marked-up document We describe a tool Encoding Specification (CES) (Ide, 1998) The scheme architecture that supports the encoding scheme, and then show is based on recognized standards and is therefore likely how we have used the encoding scheme and the tools to to be reusable with different software systems To perform a discourse analytic task in support of a model of global discourse cohesion called Veins Theory (Cristea, Ide support our scheme, we propose a data architecture that and Romary, forthcoming) enables multiple views of a document3 (based on the CES scheme outlined in Ide 1998), and a reference kage system based on Bruneseaux and Romary 1 Introduction lin(1997) These schemes have been developed with an Recent work on discourse processing has demonstrated eye toward flexibility and extendibility, in order to be the need for large corpora annotated for relational of the widest possible use In particular, the data structures in discourse (Cristea and Webber, 1997; architecture enables access to different annotations of a Marcu, 1997a) Although corpora marked for discourse corpus with minimal processing overhead, and allows structure are beginning to exist,1 they are typically the simultaneous representation of different (and marked using ad hoc encoding formats that are sometimes incompatible) annotations of the same data designed to accommodate a specific piece of software We have tested the scheme by applying it to a small and/or research need No coherent, consistent and, corpus in English, French, and Romanian, and above all, standardized encoding scheme for discourse subsequently used it for our research on VT 4 structure currently exists, and as a result, it is common In section 2, we describe a tool architecture supporting that available corpora require considerable effort to be our encoding scheme In section 3, we provide an generally usable for discourse study overview of the encoding conventions and in section 4 We have taken a more principled approach to the we give a brief description of VT and demonstrate how development of an encoding scheme for discourse the annotated corpora have been used to validate this structure annotation Our work grows out of our own theory need for corpora annotated for discourse structure and reference We describe elsewhere Cristea, Ide and 2 The Annotation Architecture Romary (forthcoming) an approach to long-distance We have defined a multi-level (hierarchical) parallel reference resolution that demonstrates the relation annotation architecture compatible with the data between discourse cohesion and coherence and architecture defined in the CES that accommodates discourse structure, called Veins Theory (VT) VT is different annotation views of the same document In our centered around the identification of “veins” over scheme, a "hub" document (HD), containing markup for discourse structure trees such as those defined in basic document structure down to the level of paragraph Rhetorical Structure Theory (RST) (Mann and as well as (possibly) some sub-paragraph markup for Thompson, 1987) To validate our theory, it is sentence segmentation and/or special tokens such as necessary to test it on a large sample of real data that is names, dates, etc , is referenced via inter-document annotated both for discourse structure and reference However, no existing scheme currently supports this kind of markup to the extent required for our work 2 Therefore, we devised an encoding scheme that XML is the Extended Markup Language, which is likely to provides for reference annotation and allows for become the successor of SGML 3 This data architecture has been adopted for a system of corpus-processing tools (LT NSL) available from Edinburgh 1 See, for example, the Discourse Resource Initiative at University; see McKelvie et al (forthcoming) http://www georgetown edu/luperfoy/Discourse-Treebank/dri-4 Our results using this test data are described in Cristea & home html Ide (forthcoming) links by a family of documents, each containing an hierarchy represent annotations made from different additional view (AD) of the HD perspectives of the same original hub document The The overall architecture is that of a directed acyclic markup from all parents is combined in the child graph (DAG) with the HD as its root, thereby document The inheriting system is non-monotonic disallowing circular addressing All documents in the Source SGML Source Automatic GLOSS GLOSS DB DB SGML session annotation session image image DB image Figure 1: Mixed manual-automatic annotation with GLOSS To implement this view-based scheme, an annotation tool • deletion of inherited elements without affecting called GLOSS (Cristea, Craciun and Ursu, forthcoming), the parent view was developed with the following features:  interactive discourse structure annotation: • SGML compatibility: the annotator takes as input both annotating the discourse structure in GLOSS is an plain texts and SGML documents paired with their interactive visual process that aims at creating a DTDs5 At any point in the annotation process, the binary tree (Marcu, 1997, Cristea and Webber, 1997), document can be saved in SGML format; where intermediate nodes are relations and terminal • database image copy: during the annotation process, an nodes are units Experience gained by authors in internal representation of the markup is kept in an manual annotation of discourse structure trees reveals associated database When an annotation session is that an incremental, unit-by-unit evolution precisely finished, the associated database can be saved for mimicking an automatic expectation-based parsing interrogation purposes Queries addressed to the (Cristea and Webber, 1997) is not compulsory during database can be expressed in SQL6; a manual process Manual annotation is closer to a • manual/automatic annotation: once a database image trial-and-error, island-driven process To facilitate the of a document exists, it is used as input for a subsequent tree structure building, GLOSS allows development annotation session with GLOSS This enables enriching of partial trees that can subsequently be integrated of certain types of tags using an automatic procedure, as into existing structures by adjoining or substitution outlined in Figure 1; The principal advantage of this architecture is that it • multiple parentage/multiple views: GLOSS allows for accommodates independent views of the same SGML the unification of the database representations of the document As such, different teams with different declared parent documents Therefore, when a expertise can work independently one of the other on the document inherits from two or more parent documents, same original document, each accomplishing different another database is generated that copies common parts annotation tasks Later, by simply declaring the resulting from these parents and adds the markup that is specific documents as parent views, GLOSS will combine the to each of them Once the parentage relations are different annotations into a single document, retaining established (which occurs when a new view is created), only one instantiation of common markup the document loses all connection with its parent documents, such that modifications can be made to the 3 Overview of the Encoding Conventions new document without affecting the originals; • non-monotonic behavior: because each document is The encoding conventions that we adopt for reference associated with its own database, the user can perform annotation and discourse structure are based upon a modifications as follows: simple but important principle of separation of segmental • creation of a new view defined to inherit from and relational markup Segmental markup includes one or more parent views; elementary identification of the units of interest for a • addition, modification or deletion of given study (e g , referring strings, discourse units, etc ) attribute-value pairs on elements inherited from Relational markup identifies structural constraints parent views, without affecting the view defined between these units (e g , co-referential links, discourse by the markup in the ancestor; relations, etc ) Separation of these two types of markup • addition of new elements together with their has the following advantages: attributes, read-accessible to any inferior view;  segmental information is likely7 to be theory-independent and consensual, whereas the nature and number of relations will change depending on the approach to reference (strict co-referential 5view, anaphoric chains, etc ); The current implementation allows for a simplified DTD syntax 67 The current implementation does not enable database Well known problems at this level include inclusion of interrogation within the annotator complements in referring units, marking of verb phrases, etc  our annotation architecture enables multiple u6 mais sa soeur, sa riche soeur, la relational encodings for the same segmental level, belle Madame Delphine De Nucingenp74, thus providing potentially several perspectives on the femme d'un homme d'argentp74a, meurt de same text; chagrin;  separation of the two types of markup implies two u7 la jalousie lap75 dévore, phases in the annotation process of a given document, u8 ellep76 est à cent lieues de sa thus enabling better evaluation of results from each soeurp77; phase u9 sa soeurp78 n'est plus sa soeurp79; In our scheme the segmental markup is realized as u10 ces deux femmesp80 sep81 renient entre follows: reference strings are marked using the tag, ellesp82 comme ellesp83 renient leur as described in detail in Bruneseaux and Romary (1997), père p84 while for discourse structure the TEI/CES element with attribute TYPE=UNIT and an unique ID is used The The marked-up version of this fragment is as follows: tags are nested inside elements Relational markup, which identifies structural relationships among segments (e g , co-referential links, RST relations among discourse units) is encoded using the TEI/CES IL EXISTE QUELQUE CHOSE DE PLUS EPOUVANTABLE QUE NE element with a unique ID and the TARGETS attribute L'EST L'ABANDON DU marking the list of two daughters (we have adopted a PERE PAR binary tree representation for the discourse structure, as in SES DEUX Marcu (1997) and Cristea and Webber (1997)) A third FILLES , attribute, NUCLEI, enables the identification of the QUI daughter nuclei elements group LE elements that comprise part of the same level of VOUDRAIENT MORT annotation The overall encoding structure is illustrated by C'EST LA RIVALITÉ DES the following:8 DEUX SOEURS ENTRE ELLES FIRST UNIT SECOND UNIT RESTAUD THIRD UNIT A DE LA NAISSANCE, FOURTH UNIT SA FEMME A ÉTÉ ADOPTÉE, ELLE TARGETS="U1 L2" NUCLEI="U1"/> A ETE PRESENTEE ; MAIS TARGETS="L3 U4" NUCLEI="L3 U4"/> SA SOEUR, SA RICHE MADAME DELPHINE DE NUCINGEN , FEMME D' UN HOMME D'ARGENT , MEURT DE CHAGRIN ; For example, consider the following fragment9 (referring LA JALOUSIE expressions are underlined and indexed with their IDs for LA readability): DÉVORE, u1 Il existe quelque chose de plus épouvantable que ne l'est l'abandon ELLE du pèrep65 p a r ses deux fillesp66, quip67 EST À CENT LIEUES DE lep68 voudraient mort SA SOEUR u2 C'est la rivalité des deux soeursp69 ; entre ellesp70 u3 Restaudp71 a de la naissance, SA SOEUR u4 sa femmep72 a été adoptée, N'EST PLUS u5 ellep73 a été présentée; SA OEUR ; 8 CES DEUX FEMMES For clarity and brevity, the example includes annotations SE “collapsed” with the Hub Document to form a single SGML document rather than a graph of interrelated documents, as RENIENT ENTRE ELLES COMME outlined in section 4 However, in reality the different types of ELLES RENIENT markup are included in separate SGML documents 9 LEUR PÈRE From Honoré de Balzac, Le Pere Goriot 4 Application of the Architecture to Structure-Reference Study In Cristea, Ide and Romary (forthcoming), we propose an ;; Pere Goriot's daughters10 approach to long-distance reference resolution that demonstrates the relation between discourse cohesion and coherence and discourse structure Our model, which we call Veins Theory (VT), is centered on the identification of “veins” over RST-like discourse structure trees The fundamental assumption underlying VT is that an inter-unit reference is possible only if the two units are in a structural relation with one another In Cristea, Ide and Romary (forthcoming) we describe the means by which veins are computed over discourse structure trees and then define domains of accessibility derived from the veins ;; Pere Goriot Accessibility domains for any node in a discourse structure tree may include units which are sequentially distant in the text stream, and thus long-distance references (including those requiring “jumps” over units or segments that contain syntactically feasible referents) can be accounted for Thus our model provides a ;; Mme de Restaud description of global discourse cohesion, which significantly extends the model of local cohesion provided by Centering Theory (CT) (Grosz, Joshi, and Weinstein 1995) The domain of accessibility of a unit is defined as the string of unit labels appearing in its vein expression and preceding that unit label The main conjecture of VT is ;; Mme de Nucingen that references from a given unit are possible only in its domain of accessibility Therefore, in VT reference domains for any node may include units that are sequentially distant in the text stream, and thus long-distance references (including those requiring “return-pops” (Grosz , Fox ) over segments ;; Relation type links that contain syntactically feasible referents) can be accounted for A smoothness score for a discourse segment can be computed by attaching an elementary score to each transition between sequential units according to Table 2, summing up the scores for each transition in the entire segment, and dividing the result by the number of transitions in the segment This provides an index of the overall coherence of the segment Table 2: Smoothness scores for transitions CENTER CONTINUATION 4 CENTER RETAINING 3 CENTER SHIFTING (SMOOTH) 2 CENTER SHIFTING (ABRUPT) 1 NO Cb 0 A global CT smoothness score can be computed by adding up the scores for the sequence of units making up the whole discourse, and dividing the result by the total number of transitions (number of units minus one) In As this example shows, we currently base our linkage general, this score will be slightly higher than the average mechanisms on the TEI extended pointer mechanisms of the scores for the individual segments, since accidental However, we are exploring the use of the pointer transitions at segment boundaries will be included mechanism defined by the WWW Consortium using Analogously, a global VT smoothness score can be XML (Maler & DeRose, 1998), which are inspired by the computed using accessibility domains to determine TEI Guidelines and amenable to support by a wide range transitions rather than sequential units Using this data, we of software can then compare the smoothness scores using CT and VT We claim that the global smoothness score of a discourse when computed following VT is at least as high as the score computed following CT To validate this claim and 10 Our comments BD VT in general, we implemented the above annotation scheme to encode a small corpus of texts in English, French, and Romanian to use for validating VT The RS-VIEW U-VIEW following texts were included in our analysis: • three short English texts, RST-analyzed by experts RL-VIEW RS-IN-U-VIEW REL-VIEW (source: Daniel Marcu, described in Marcu ) and subsequently annotated for reference and Cf lists by the authors; VEINS-VIEW • a fragment from Honoré de Balzac’s Le Père Goriot CF-VIEW (French), previously annotated for co-reference (Brunseaux and Romary ); RST and Cf list (see below) annotation made by the authors; CT-VIEW • a fragment from Alexandru Mitru’s “Legendele Olimpului”11 (Romanian); structure, reference, and Cf lists annotated by one of the authors VT-VIEW As described in section 3, the encoding marks referring expressions, links between referring expressions Figure 2: The hierarchy of views for the validation of VT (co-reference or functional) units, relations between units (if known), and nuclearity We also include an attribute to • U-VIEW: marks discourse units (sentences, and encode forward-looking centers (Cf) comprising a list of possibly clauses) Units are marked as elements referring expressions, and backward-looking centers (Cb), with TYPE=UNIT which consist of a single 12 As centers are semantic • REL-VIEW: reflects the discourse structure in terms of entities, we have identified a center with a chain of a tree-like representation surface co-references, therefore a with • VEINS-VIEW: includes markup for head and vein TYPE=COREF Any ID of the chain of co-reference links expressions HEAD and VEIN attributes (with values can be used to identify the semantic entity With this comprising lists) are added to all and convention Cb's can be computed automatically A elements program13 does this but also the following: • RS-IN-U-VIEW: inherits and • builds the tree structure of units and relations between elements from U-VIEW and RS-VIEW It also includes them; markup that identifies the discourse unit to which a • adds to each referring expression the index of the unit referring string belongs it occurs in; • CF-VIEW: inherits all markup from RS-IN-U-VIEW, • computes the heads and veins for all nodes in the and adds a list of forward looking centers (the CF structure; attribute) to each unit in the discourse • determines the accessibility domains of the terminal • CT-VIEW (Centering Theory view): inherits Cf lists nodes (units); from the CF-VIEW and backward references from the • counts the number of direct and indirect references in RL-VIEW Using the markup in this view, first Cb's order to validate VT (the CB-C14 attribute of the elements) and then transitions can be computed following classical The hierarchy of views encoded in the documents is given CT therefore between sequential units A global in Figure 2 The views include: smoothness score following CT is finally computed VT-VIEW (Veins Theory view): inherits Cf centers • BD: the base document, containing the unannotated text • and possibly markup for basic document structure down from the CF-VIEW, back-references from the to the level of paragraph RL-VIEW, and vein expressions from the • RS-VIEW: includes markup for isolated reference VEINS-VIEW The VT-VIEW also includes markup for strings The basic elements are 'S (reference Cb's computed15 along the veins of the discourse structure -H attribute of the strings) (the CB • RL-VIEW: the reference links view, imposed over the elements) Transitions are computed following VT and RS-VIEW, includes reference links between an anaphor, then a VT smoothness score or source, and a referee, or target Links configure The results are partly summarized in Table 1, which co-reference chains, but can also indicate bridge shows that the score for VT is better than that for CT in references (Strübe and Hahn, 1996; Passoneau, 1994, all cases A complete analysis of the investigations 1996) performed in order to validate VT is given in Cristea&Ide (forthcoming) 11 “The Legends of Olimp” 12 In CT each unit is associated with a list of forward-looking centers (Cf lists), where elements are partially ranked according to discourse salience; and a unique backward-looking center (Cb), that is the first center in the Cf list of the previous unit also realized in the current unit 14 From "classical" 1315 Written in Java From "hierarchical" Source No of CT Score Average CT score per VT score Average VT score transitions transition per transition English 59 76 1 25 84 1 38 French 47 109 2 32 116 2 47 Romanian 65 142 2 18 152 2 34 Total 173 327 1 89 352 2 03 Table 1: CT smoothness scores vs VT smoothness scores Grosz, B J (1977) The Representation and Use of 5 Conclusion Focus in Dialogue Understanding Ph D In this paper we outline an encoding scheme and a data Dissertation, University of California, Berkeley architecture for discourse, together with a set of tools Grosz B J , Joshi, A K , & Weinstein, S t (1995) that support the annotation of corpora We have used Centering: A Framework for Modeling the Local these tools to annotate corpora in English, French, and Coherence of Discourse, Computational Linguistics, Romanian and used them to study a model of discourse 21(2), 203-225 cohesion based on Veins Theory Our results Hahn, U and Strübe, M (1997) Centered demonstrate that VT provides a promising approach to Segmentation: Scaling Up the Centering Model to identifying domains of referential accessibility in Global Discourse Structure Proceedings of discourse EACL/ACL'97 (pp 104-111), Madrid There is, at present, no encoding standard for discourse Ide, N (1998) Corpus Encoding Standard: SGML The few annotated corpora available are encoded using Guidelines for Encoding Linguistic Corpora First a variety of formats, which in turns often demands International Language Resources and Evaluation re-encoding when these corpora are used with different Conference, Granada, Spain (this volume) pieces of software In our view, it is essential to not See also http://www cs vassar edu/CES/ only determine a standard for encoding discourse, but Maler, E & DeRose, S (1998), XML Pointer Language also to define a data architecture which is maximally (Xpointer), WWW Consortium Working Draft, 3 flexible The view-based architecture and inheritance March 1998, http://www w3c org/TR/WD-xptr mechanism described in this paper provide a viable Mann, W C & Thompson, S A (1987) Rhetorical framework for discourse encoding, which allows the Structure Theory: A Theory of Text Organisation, representation of a variety of types of annotation and Text, 8(3), 243-281 can accommodate different theories and perspectives Marcu, D (1997) The Rhetorical Parsing of Natural We are currently exploring the extension of our scheme Language Texts Proceedings of the 35th Annual to support multi-lingual analyses; this should be readily Meeting of the Association for Computational accomplished using linkage mechanisms similar to Linguistics (pp 96-103), Madrid those described here to associate parallel text passages Marcu, D (1997a) The Rhetorical Parsing, Summarization, and Generation of Natural Language Texts, Ph D Dissertation, University of Toronto References McKelvie, D , Brew, C & Thompson, H S (1998) Bruneseaux, F & Romary, L Codage des références et Using SGML as a Basis for Data-Intensive Natural coréférences dans les dialogues homme-machine Language Processing Forthcoming in Computers and Proceedings of the Joint International Conference of the Humanities (in press) the Association for Computers and the Humanities Passonneau, R J (1994) Protocol for coding discourse and the Association for Literary and Linguistic referential noun phrases and their antecedents, Computing, Kingston, Ontario Technical Report, Columbia University Cristea, D , Ide, N and Romary, L (forthcoming) Passonneau R J (1996) Using Centering to Relax Veins Theory: A Model of Global Discourse Gricean Informational Constraints on Discourse Coherence and Cohesion Anaphoric Noun Phrases, Language and Speech, Cristea, D , Craciun, O and Ursu, C (forthcoming) 39(2-3), 229-264 GLOSS: A Visual Interactive Tool for Discourse Sperberg-McQueen C M & Burnard, L (Eds ) (1994) Annotation Guidelines For Electronic Text Encoding and Cristea, D ; Webber B L (1997) Expectations in Interchange ACH-ACL-ALLC Text Encoding Incremental Discourse Processing Proceedings of the Initiative, Chicago and Oxford 35th Annual Meeting of the Association for Strube, M & Hahn, U (1996) Functional Centering, Computational Linguistics (pp 88-95), Madrid Proceedings of ACL '96, Santa Cruz Fox, B (1987) Discourse Structure and Anaphora Written and Conversational English Cambridge Studies in Linguistics, 48 Cambridge University Press 